Goal
Build Logistic Regression model whether user has diabetes or not. Model would be built using sklearn.linear_model library & served in web using streamlit.
Overview
The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.
Objective is to predict whether user is diabetic or not based on certain diagnostic measurements.
Understanding the variables in the dataset
Let's understand the dataset, it has 8 prediction variables & 1 target/dependent variable "Outcome" which is encoded as 0-No diabetes & 1-With diabetes. Other columns in the dataset are explained below:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)2)
DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
Age: Age (years)
Outcome: Class variable (0 if non-diabetic, 1 if diabetic)
Exploring the dataset
We will explore the dataset using Jupyter Lab Notebook & pandas library.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import joblib
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.decomposition import PCA
Importing the dataset
We will import the dataset pima-indians-diabetes.data.csv from GitHub into Pandas dataframe & show first few records from the dataframe.
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
diabetesDF = pd.read_csv(url, names=names)
diabetesDF.head(2)
diabetesDF.info()
Data Exploration
Let us now explore our dataset to get a feel of what it looks like and get some understanding about it.
Pandas dataframe.corr() is used to find the pairwise correlation of all columns in a dataframe. Any na values are automatically excluded. Any non-numeric data type column in the dataframe will be ignored. So encode all categorical values into numerical values before fining correlation between attributes in the dataset.
There are 3 types of correlation methods dataframe.corr() support: Pearson Kendall Spearman
Let's detail them in another blog for now don't use just df.corr(), specifically call which method you are calling to ensure its not fully blackbox.
Seaborn lets us visually see correlation between variables attributes using heatmap. Picture tells better story than numbers:)
corr = diabetesDF.corr(method ='pearson')
print(corr)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns,annot=True,cmap='RdBu_r',linewidths=0.5,ax=ax)
In the above chart, if the box is redder it has more correlation to people having diabetes.
From the above chart, we could get Blood Glucose level is important contributing factor for diabetes in pima women followed by BMI,age & no of pregnancies. Pima women get gestational diabetes(diabetes during pregnancies) ,what could be the contributing factor? getting pregnant at older age seems to contribute more per data.
Also notice the correlation between pairs of features, like age and pregnancies, or insulin and skin thickness.
From the above figure, we can draw the following conclusions.
1) diabetes pedigree function , BMI, pregnancies have significant influence on the model, specially diabetes pedigree function , BMI, pregnancies. It is good to see our machine learning model match what we have been hearing from doctors our entire lives!
2) Blood pressure has a negative influence on the prediction.
3) If you have more pregnancies, it can possibly result in diabetes.
Let’s also look at how many people in the dataset are diabetic and not.
sns.countplot(x = 'class', data = diabetesDF, palette = 'magma')
plt.title('People with Diabetes/without Diabetes')
plt.show()
Average age of people having diabetes vs without diabetes
sns.barplot(x = 'class', y = 'age',data = diabetesDF,
palette = 'hls',
capsize = 0.05,
saturation = 8,
errcolor = 'gray', errwidth = 2,
ci = 'sd'
)
To understand similarities as well as differences in the dataset between diabetic and non-diabetic patients
sns.pairplot(diabetesDF,hue='class')
Dataset Preparation
When using machine learning algorithms to build model we should always split our data into a training set and test set to avoid overfitting. The data set consists of record of 767 users in total. 70% of it we would use it for testing & rest 30% for training the model.
# Split dataset into training set and test set
array = diabetesDF.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.50
seed = 105
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
# Scale the data
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Apply dimensionality reduction
pca = PCA(n_components=3)
model = pca.fit(X_train_scaled)
X_train_dim_red = model.transform(X_train_scaled)
#Test data
X_test_dim_red = pca.transform(X_test_scaled)
model.explained_variance_ratio_
Extracting the features of the PCA output
n_pcs= model.components_.shape[0]
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',]
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
df = pd.DataFrame(dic.items())
df
#Creating the model
diabetesCheck = LogisticRegression()
diabetesCheck.fit(X_train_dim_red, Y_train)
accuracy = diabetesCheck.score(X_test_dim_red, Y_test)
print(f' Accuracy of the model : {round(accuracy * 100,2)} %')
#Saving the Model to disk
filename = 'diab_web.sav'
pickle_out = open(filename, "wb")
joblib.dump(diabetesCheck, pickle_out)
pickle_out.close()
Load from Pickle to check if model is dumped correctly
# load the model from disk
diabetesLoadedModel = joblib.load(open(filename, 'rb'))
accuracyModel = diabetesLoadedModel.score(X_test_dim_red, Y_test)
print("accuracy = ",round(accuracyModel * 100,2),"%",diabetesLoadedModel)